18 research outputs found

    Tracing Communications and Computational Workload in LJS (Lennard-Jones with Spatial Decomposition)

    Get PDF
    LJS (Lennard-Jones with Spatial decomposition) is a molecular dynamics application developed by Steve Plimpton at Sandia National Laboratories [1]. It performs thermodynamic simulations of a system containing fixed large number (millions) of atoms or molecules confined within a regular, three-dimensional domain. Since the simulations model interactions on atomic scale, the computations carried out in a single timestep (iteration) correspond to femtoseconds of the real time. Hence, a meaningful simulation of the evolution of the system's state typically requires a large number (thousands and more) of timesteps. The particles in LJS are represented as material points subjected to forces resulting from interactions with other particles. While the general case involves N-body solvers, LJS implements only pair-wise material point interactions using derivative of Lennard-Jones potential energy for each particle pair to evaluate the acting forces. The velocities and positions of particles are updated by integrating Newton's equations (classical molecular dynamics). The interaction range depends on the modeled problem type; LJS focuses on short-range forces, implementing a cutoff distance rc outside which the interactions are ignored. The computational complexity of O(N2), characteristic for systems with long-range interactions, is therefore substantially alleviated. LJS deploys spatial decomposition of the domain volume to distribute the computations across the available processors on a parallel computer. The decomposition process uniformly divides parallelepiped containing all particles into volumes equal in size and as close in shape to a cube as possible, assigning each of such formed cells to a CPU. The correctness of computations requires the positions of some particles (depending on the value of rc) residing in the neighboring cells to be known to the local process. This information is exchanged in every timestep via explicit communication with the neighbor nodes in all three dimensions (for details see [2]). LJS also takes the advantage of the third Newton's law to calculate the force only once per particle pair; if the involved particles belong to cells located on different processors, the results are forwarded to the other node in a "reverse communication" phase. Besides communications occurring in every iteration, additional messages are sent once every preset number of timesteps. Their purpose is to adjust cell assignments of particles due to their movement. To minimize the overhead of the construction of particle neighbor lists, LJS replaces rc with extended cutoff radius rs (rs > rc), which accounts for possible particle movement before any list updates need to be carried out. Due to a relatively small impact of that phase on the overall behavior of the application, we ignored it in our analysis

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    Analysis, Tracing, Characterization and Performance Modeling of Select ASCI Applications for BlueGene/L Using Parallel Discrete Event Simulation

    Get PDF
    Caltech's Jet Propulsion Laboratory (JPL) and Center for Advanced Computer Architecture (CACR) are conducting application and simulation analyses of Blue Gene/L[1] in order to establish a range of effectiveness of the architecture in performing important classes of computations and to determine the design sensitivity of the global interconnect network in support of real world ASCI application execution

    Continuum Computer Architecture for Nano-scale and Ultra-high Clock Rate Technologies

    Get PDF
    Continuum computer architecture (CCA) is a non-von Neumann architecture that offers an alternative to conventional structures as digital technology evolves towards nano-scale and the ultimate flat-lining of Moore's law. Coincidentally, it also defines a model of architecture particularly well suited to logic classes that exhibit ultra-high clock rates (> 100 GHz) such as rapid single flux quantum (RSFQ) gates. CCA eliminates the concept of the "CPU" that has dominated computer architecture since its inception more than half a century ago and establishes a new local element that merges the properties of state storage, state transfer, and state operation. A CCA system architecture is a simple multidimensional organization of these elemental blocks and physically may be considered as a new family of cellular computer. But CCA differs dramatically from conventional cellular automata. While both deliver emergent global behavior from the aggregation of local rules and ensuing operation. The CCA emergent behavior is a global general-purpose model of parallel computation, as opposed to simply mimicking some limited phenomenon like heat and mass transfer as do conventional cellular automata. This paper presents the motivation and foundation concepts of CCA and exposes key issues for further work

    Przegląd ogólny swojego zawodu lekarskiego i nauczycielskiego

    No full text

    MPI-IO Implementation Strategies for the Cenju-3

    No full text
    The lack of a portable parallel I/O interface limits the development of scientific applications. MPI-IO is the first widespread attempt to alleviate this problem. Its efficient implementation requires the developer to face and solve several software design and interface issues. Our paper outlines strategies, which may be helpful in this task. Although originally targeted for the NEC Cenju-3, our considerations are applicable to other message-passing platforms as well. 1 Introduction. An initiative led by Nasa Ames and IBM Watson Research Center has resulted in creation of MPI-IO, which defines a portable interface for parallel I/O. Currently, MPI-IO is officially incorporated into MPI-2 [4], an ambitious extention of the original MPI. This paper summarizes our experiences with development of an MPI-IO system for the NEC Cenju-3 supercomputer. The Cenju-3 features a multi-level switch, NORMA architecture. Research supported by a grant from NEC Corporation. The Cenju/DE operating sy..
    corecore